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Abstract 

We performed corpus correction 
on a modality corpus for ma- 
chine translation by using such 
machine-learning methods as the 
maximum-entropy method. We 
thus constructed a high-quahty 
modality corpus based on corpus 
correction. We compared several 
kinds of methods for corpus cor- 
rection in our experiments and 
developed a good method for cor- 
pus correction. 

1 Introduction 

In recent years, various types of tagged cor- 
pora have been constructed and much re- 
search using tagged corpora has been per- 
formed. However, tagged corpora include 
errors, which impede the progress of re- 
search. Therefore, the correction of errors 
in corpora is an important research issue.Q 
We have researched error correction in 
corpora by using the modality corpus we 
are currently constructing.^ This modality 
corpus consists of supervised learning data 
used for research on translating Japanese 
tense, aspect, and modality into English 

^ There is no previous paper on error correction in 
corpora. In terms of error detection in corpora, there 
has been research using boosting or anomaly detection 



( |Abney et al., 199E| ; |Eskin, 200o| ) 



, kono kodomo wa aa ieba kou iu kara 
koniku-rashii 

This child always talks back to me, and 
this <v>is</v> why I <vj>hate</vj> 
him. 

d kare ga aa okubyou da to wa 
omowana-katta 

I <v>did not think</v> he was so timid. 

c aa isogashikute wa yasumu hima mo 
nai hazu da 

Such a busy man as he <v>cannot 
have</v> any spare time. 



Figure 1: Part of the modality corpus 



^'his paper is the Er iglish translation of the pa- 
per (Murata et al., 2001b). We also pe rformed corpus 



corre ction in a morphological corpus ( Murata et al. 
2000|). 



QMurata et al., 1999| ; ^urata et al., 200 la| ). 
(In this paper, we regard the word modal- 
ity in the broad sense of including tense 
and aspect.) Tense, aspect, and modal- 
ity are known to present difficult problems 
in machine translation. In traditional ap- 
proaches, tense, aspect, and modality have 
been translated by using manually con- 
structed heuristic rules. Recently, how- 
ever, such corpus-based approaches as the 
example-based method have also been ap- 
plied. The modality corpus we consider in 
this paper is necessary for such machine 
translation based on the corpus-based ap- 
proach. 

In this paper, we describe the modal- 
ity corpus in Section ||, the method of cor- 
pus correction in Section ^, and our exper- 
iments on corpus correction in Section ^ 



2 Modality Corpus for Machine 
Translation 

In this section, we describe the modality 
corpus. A part of it is shown in Figure ||. 
It is composed of a Japanese- English bilin- 
gual corpus, and each English sentence can 
include the following two types of tags. 

• The English main verb phrase is 
tagged with <v>. 

• The English verb phrase correspond- 
ing to the Japanese main verb phrase 
is tagged with <vj>. 

The symbols at the beginning of each 
Japanese sentence, such as "c" and "d", 
indicate a category of tense, aspect, and 
modality for the sentence. (For exam- 
ple, "c" and "d" indicate "can" and "past 
tense", respectively. The first symbol in 
Figure |^ is ",". This symbol is used when 
<vj> is used, such that the left part in- 
dicates the category of the verb phrase 
tagged with <v> and the right part in- 
dicates the category of the verb phrase 
tagged with <vj>. In this corpus, the num- 
ber of examples of present tense is large, so 
the symbol for present tense is a null ex- 
pression (i.e., "").) <vj> is tagged when 
the verb phrase with <v> does not corre- 
spond with the Japanese main verb. 

We use the following 34 categories for 
tense, aspect, and modality. These cate- 
gories are determined by the surface ex- 
pressions of the English verb phrases. 

1. all combinations of {present tense, 
past tense}, {progressive, not- 
progressive}, and {perfect, not- 
perfect} (8 categories) 

2. imperative mood (1 category) 

3. auxiliary verbs ({present tense, past 
tense} of "be able to" , {present tense, 
past tense} of "be going to", "can", 
"could", {present tense, past tense} 



of "have to", "had better", "may", 
"might", "must", "need", "ought", 
"shall", "should", "used to", "will", 
"would") (19 categories) 

4. noun phrases (one category) 

5. participial construction (one cate- 
gory) 

6. verb ellipsis (one category) 

7. interjection or greeting sentences (one 
category) 

8. the case when a Japanese main verb 
phrase cannot correspond to an En- 
glish verb phrase (one category) 

9. the case when tagging cannot be per- 
formed (one category) 

These categories of tense, aspect, and 
modality are defined on the basis of the 
surface expressions of the English sen- 
tences. So, if we can estimate the cor- 
rect category from a Japanese sentence, we 
should be able to translate the Japanese 
tense, aspect, and modality into En- 
glish. Therefore, in researching the trans- 
lation of modality expressions based on the 
machine-learning method, only the tags in- 
dicating the categories of tense, aspect, 
and modality and the Japanese sentences 
are used. 

We placed an order with an outside 
company to construct the modality cor- 
pus according to the above conditions. We 
used about 40,000 example sentences from 
the Kodansha Japanese-English dictionary 
( [Shimizu and INarita, 1975| ) 1 bilingual 
corpus. The outside company performed 
the tagging of <v> and the corresponding 
categories of modality by hand. Inspection 
work was performed more than twice, until 
the outside company considered no errors 
at all to exist in the corpus. 



3 Method of Corpus Correction 

In this section, we describe the method of 
correcting errors in the modahty corpus 
constructed by hand, as described in the 
previous section. The method is to calcu- 
late the probabilities of tags, which are ob- 
jects for error correction in a corpus, and 
then perform corpus correction by using 
those probabilities. In this paper, we only 
consider tags for modality categories, not 
"<v>" and "<vj>" tags. 

We tested two kinds of methods for 
calculating the probability of each tag: 
the maximum-entropy method, and the 
decision-list method.^ 

• Method based on the maximum- 
entropy method ( [Ristad, 1997| ; |Ris- 
tad, 199^ ) 



In this method, the distribution of 
probabilities p{a, b) is calculated for 
the case when Equation (|I]) is satis- 
fied and Equation (|^) is maximized, 
and the desired probabilities p{a\b) 
are then calculated by using the dis- 
tribution of probabilities p(a, b): 



^ P{a,'b)gj{a,b) = ^ p(a,b)gj{a,b) (1) 
for V/, (1 < j < k) 



H{p) 



p{a,b)log{p{a,b)),i2) 



where A, B, and F are sets of cat- 
egories, contexts, and features /^(g 
F,l < j < k), respectively; gj{a,b) is 
a function defined as 1 when context 
b has feature fj and the category is a, 
or defined as otherwise; and p{a, b) 



is the occurrence rate of (a, b) in the 
training data. 

In general, the distribution of ^(a, b) 
is very sparse. We cannot use it di- 
rectly, so we must estimate the true 
distribution of p(a, b) from the distri- 
bution of p(a, b). We assume that the 
estimated values of the frequency of 
each pair of category and feature as 
calculated from p(a, b) are the same as 
those from p(a, b) (This corresponds 
to Equation (|T]).). These estimated 
values are not so sparse. We can thus 
use the above assumption for calcu- 
lating p(a, 6). Furthermore, we maxi- 
mize the entropy of the distribution of 
p[a, b) to obtain one solution of p(a, 6), 
because only using Equation [l| pro- 
duces many solutions for p(a, b). Max- 
imizing the entropy has the effect of 
making the distribution more uniform 
and is known to be a good solution for 
data sparseness problems. 

Method based on the decision-list 
method ( |Yarowsky, 1994| ) 

In this method, the probability of each 
category is calculated by using one of 
features F, 1 < j < k). The 

probability that produces category a 
in context b is given by the following 
equation: 



In this paper, we use the maximum-entropy 
method and the decision-list method to calculate the 
probabilities of each tag. However, we may use a more 
accurate method to calculate the probabilities for cor- 
pus correction. 



p{a\b) = p{a\fmax), (3) 

such that fmax is defined by 

fmax = argmaXf^fzF maXa,eA p{ai\fj), (4) 

where p{ai\fj) is the occurrence rate 
of category when the context has 
feature fj. 

In this paper, we used the following 
items as features, which are the context 
when the probabilities are calculated. (26 
(=5 + 10 + 10 + 1) features appear in each 
English sentence.) 



• The strings of 1-gram to 5-gram just 
to the left of <v> in the sentence. 

(e.g.) I <v>did not think</v> he was 
so timid. 

• The strings of 1-gram to 10-gram just 
to the right of <v>. 

(e.g.) I <v> did not think</v> he was 
so timid. 

• The strings of 1-gram to 10- gram just 
to the left of </v>. 

(e.g.) I <v>did not think </v> he was 
so timid. 

• The 1-gram string at the end of the 
sentence. 

(e.g.) I <v>did not think </v> he was 
so timid^ 

When the verb phrase was divided into two 
parts, as in an interrogative sentence, the 
above extraction of features was performed 
after eliminating the words between the 
first </v> and the second <v>. 

Because the corpus used in this paper 
was designed for estimating the modality 
of the English sentence from the Japanese 
sentence, one may think that we should ex- 
tract the features from the Japanese sen- 
tence. It is true if we want to infer English 
modalities from Japanese sentences. What 
we want to do is, however, to correct En- 
glish modality tags. Thus we should use 
all the information available. Since the 
category of the modality expression of the 
English sentence is tagged and the verb 
phrase of the English sentence is examined 
for construction of the corpus by hand, it is 
reasonable to use the English verb phrase 
in corpus correction based on the machine- 
learning method. 

Next, we describe the method of judging 
whether each tag in the corpus is incorrect 
or not. We first calculate the probabili- 
ties of the category of the tag, and of the 



other categories. We judge that the tag 
is correct when its category has the high- 
est probability and incorrect when one of 
the other categories has the highest prob- 
ability. Next, we correct the tag if it is 
judged to be incorrect. This correction is 
performed by changing the tag to the tag 
of the category with the highest probabil- 
ity. (This correction is confirmed by anno- 
tators in actuality.) 

Corpus correction should be confirmed 
by human beings. Therefore it is very time 
consuming. However, when the probabili- 
ties of each tag can be calculated, we can 
define the confidence value of the corpus 
correction, as described below. It is thus 
more convenient to sort the error candi- 
dates in the corpus by confidence value and 
begin by correcting the errors for which the 
confidence value is higher. 

We tested the following two types of 
methods for determining the confidence 
value for corpus correction. 

• Method 1 — the probability of the 
category with the highest probability 
is used as the confidence value for cor- 
pus correction. 

• Method 2 — the non-probability of 
the tag originally defined is used as 
the confidence value for corpus correc- 
tion. 

In this paper, the non-probability is de- 
fined as the value obtained by subtracting 
the probability from 1. 

We finally explain the methods to use 
data for calculating probabilities. There 
are two kinds of methods for calculating 
the probabilities by using the machine- 
learning method: 

• calculation of probabilities for the 
closed data, and 

^This action of corpus correction is exactly equiv- 
alent to redefining the tag in the corpus by using a 
machine-learning method and re-tagging the newly de- 
fined tag. 



• calculation of probabilities for the 
open data. 

The first method calculates probabilities 
by using all the tags in the corpora includ- 
ing the tag which is judged currently. The 
second method does not use the tag which 
is judged currently. In this paper, 10-fold 
cross validation was used for calculating 
probabilities for the open data. Q 

4 Experiments on Corpus 
Correction 

We carried out experiments on corpus cor- 
rection by using the methods described in 
the previous section. These experiments 
were performed after eliminating the sen- 
tences given tags indicating that tagging 
could not be performed. Thus, these 
experiments were performed for 39,718 
modality tags. The results are shown in 
Tables to |. "random 300" indicates 
the precisions for 300 tags extracted ran- 
domly from among the tags corrected by 
our system, "top X" indicates the preci- 
sions for the top X tags sorted by Method 
1 or Method 2. "Precision for detection" 
indicates the percentage of tags for which 
detection of an error succeeded in caus- 
ing the tag to be corrected by our system, 
while "Precision for correction" indicates 
the percentage of tags for which correction 
of an error succeeded in causing the tag to 
be corrected by our system. 

We came to the following conclusions 
based on the experimental results. 

• Throughout all the experiments, the 

^When the probabilities are calculated using open 
data in the decision-list method, the probability of the 
category of the original tag is apt to be 0, or the prob- 
ability of the category of the tag defined after corpus 
correction is apt to be 1, because the calculation is per- 
formed by not using the original tag. Thus when there 
are many such tags, many of them have the same prob- 
ability and sorting by probabilities becomes difficult. 
In this case, we sort the tags by arranging those whose 
probability is calculated from the features which have 
many tags in descending order of confidence value for 
corpus correction. 



precisions for detection and correction 
were almost the same. Thus, we found 
that it is more convenient to perform 
both correction and detection, rather 
than only detection. 

From the viewpoint of manual modifi- 
cation, when we modify tags by hand, 
it is also more convenient for the sys- 
tem to produce a candidate category 
that is tagged to the corpus after cor- 
pus correction. This is because we can 
find how the original tag was incor- 
rect and how we should change it to 
the new corrected tag. When only de- 
tection is performed, in other words, 
a candidate category is not presented, 
an annotator may not know why the 
tag is incorrect. 

• In general, the maximum-entropy 
method produced higher precision 
than the decision-list method. How- 
ever, when the closed data was used 
to the calculate the probabilities, the 
precisions of the top items were al- 
most the same for the two methods. 

• In terms of the precisions of top items, 
using the closed data to calculate the 
probabilities was better than using 
the open data. However, in terms of 
the total number of extracted items, 
using the open data was better. 

• In terms of sorting by Method 1 or 
Method 2, Method 1 generally pro- 
duced higher precisions for the top 
items than Method 2. 

• In terms of comparing "random 300" 
and "top X" , "top X" produced much 
higher precisions for the top items 
than "random 300". We thus found 
that sorting by confidence values of 
corpus correction is very important. 

Based on the above results, we think 



Table 1: Precision of corpus correction using the maximum-entropy method (The prob- 
abihties were calculated using the closed data. 184 candidate errors were extracted.) 





Precision for detection 


Precision for correction 


random 300 


69% 


(127/184) 


68% 


(126/184) 


Method 1 


top 


50 


100% 


( 50/ 50) 


100% 


( 50/ 501 




top 


100 


92% 


( 92/100) 


92% 


( 92/100) 




top 


150 


77% 


(116/150) 


77% 


(116/150) 




top 


200 


69% 


(127/184) 


68% 


(126/184) 




top 


250 












toj) 


300 










Method 2 


top 


50 


88% 


( 44/ 50) 


88% 


( 44/ 50) 




top 


100 


81% 


( 81/100) 


81% 


( 81/100) 




top 


150 


74% 


(112/150) 


74% 


(111/150) 




top 


200 


69% 


(127/184) 


68% 


(126/184) 




top 


250 












top 


300 











Table 2: Precision of corpus correction using the maximum-entropy method (The prob- 
abilities were calculated using the open data. 694 candidate errors were extracted.) 





Precision for detection 


Precision for correction 


random 300 


28% 


( 84/300) 


26% 


( 78/300) 


Method 1 


top 


50 


88% 


( 44/ 50) 


88% 


( 44/ 50) 




top 


100 


88% 


( 88/100) 


88% 


( 88/100) 




top 


150 


80% 


(121/150) 


79% 


(119/150) 




top 


200 


68% 


(136/200) 


67% 


(134/200) 




top 


250 


60% 


(151/250) 


59% 


(149/250) 




top 


300 


53% 


(160/300) 


52% 


(157/300) 


Method 2 


top 


50 


72% 


( 36/ 50) 


72% 


( 36/ 50) 




top 


100 


74% 


( 74/100) 


71% 


( 71/100) 




top 


150 


70% 


(106/150) 


68% 


(102/150) 




top 


200 


67% 


(135/200) 


65% 


(131/200) 




top 


250 


60% 


(152/250) 


58% 


(147/250) 




top 


300 


52% 


(157/300) 


50% 


(152/300) 



Table 3: Precision of corpus correction using the decision-list method (The probabilities 
were calculated using the closed data. 383 candidate errors were extracted.) 





Precision for detection 


Precision for correction 


random 300 


34% (104/300) 


33% (101/300) 


Method 1 


top 50 

top 100 
top 150 
top 200 
top 250 

top 300 


100% ( 50/ 50) 
92% ( 92/100) 
76% (115/150) 
62% (124/200) 
51% (128/250) 
11% (132/300) 


100% ( 50/ 50) 

92% ( 92/100) 
74% (112/150) 
60% (121/200) 
50% (125/250) 

13% (129/300) 


Method 2 


top 50 
top 100 
top 150 
top 200 
top 250 
top 300 


88% ( 44/ 50) 
86% ( 86/100) 
71% (107/150) 
59% (118/200) 
50% (126/250) 
43% (129/300) 


86% ( 43/ 50) 
84% ( 84/100) 
69% (104/150) 
57% (115/200) 
49% (123/250) 
42% (126/300) 



Table 4: Precision of corpus correction using the decision-list method (the probabilities 
were calculated using the open data. 694 candidate errors were extracted.) 





Precision for detection 


Precision for correction 


random 300 


6% 


( 18/300) 


6% 


( 18/300) 


Method 1 


top 


50 


56% 


( 28/ 50) 


52% 


( 26/ 50) 




top 


100 


43% 


( 43/100) 


40% 


( 40/100) 




top 


150 


31% 


( 47/150) 


29% 


( 44/150) 




top 


200 


26% 


( 52/200) 


24% 


( 48/200) 




top 


250 


22% 


( 55/250) 


20% 


( 51/250) 




top 


300 


20% 


( 61/300) 


19% 


( 57/300) 


Method 2 


top 


50 


66% 


( 33/ 50) 


64% 


( 32/ 50) 




top 


100 


48% 


( 48/100) 


46% 


( 46/100) 




top 


150 


44% 


( 66/150) 


42% 


( 63/150) 




top 


200 


35% 


( 71/200) 


34% 


( 68/200) 




top 


250 


30% 


( 77/250) 


29% 


( 73/250) 




top 


300 


26% 


( 80/300) 


25% 


( 76/300) 



that the following strategy is a better so- 
lution. 

1. We first perform high-quality corpus 
correction by using the probability 
calculation for the closed data and 
Method 1. 

2. Next, we perform corpus correction 
for a much larger number of tags by 
using the probability calculation for 
the open data, the maximum-entropy 
method, and Method 1. 

5 Conclusion 

In this paper, we have described cor- 
pus correction using a machine-learning 
method for a modality corpus for ma- 
chine translation. We have constructed 
a high-quality modality corpus by using 
corpus correction. In the future, we plan 
to research Japanese- English translation of 
tense, aspect, and modality by using this 
corpus. 

Our method of corpus correction has the 
following advantages. 

• There is no previous paper on error 
correction in corpora. 

In terms error detection in corpora, 
there has been research using boosting 
or anomaly detection ( [Abney et al., 
19991 ; [Eskin, 2000|) . We found that the 



precisions for detection and correction 
were almost the same. Therefore, we 
should perform correction in addition 
to detection. 

Our method calculates the probabil- 
ity of each tag and can sort the er- 
ror candidates in the corpus by using 
these probabilities as confidence val- 
ues for corpus correction. Thus, we 
can begin to correct errors for which 
the confidence value is higher. 



Our method uses the machine- 
learning method and inherits its 
original advantages. 

— Our method has the same wide 
applicability as the machine- 
leaning method and can be used 
to correct a various types of cor- 
pora. 

— A large amount of human effort is 
not necessary, and human beings 
only have to provide appropriate 
feature sets used in the machine- 
learning method. 
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